[Test] Add support for fractional GPU values in Ray start parameters …#4454
Draft
tiennguyentony wants to merge 7 commits intoray-project:masterfrom
Draft
[Test] Add support for fractional GPU values in Ray start parameters …#4454tiennguyentony wants to merge 7 commits intoray-project:masterfrom
tiennguyentony wants to merge 7 commits intoray-project:masterfrom
Conversation
…and corresponding tests
…cceleratorResources Critical bug fix: The addWellKnownAcceleratorResources function was using strconv.FormatInt which truncated fractional GPU values to integers. When users specify GPU resources via container.Resources.Limits (the standard Kubernetes pattern), values like 400m (0.4 GPU) were truncated to 0. This fix applies the same FormatFloat conversion used in updateRayStartParamsResources, ensuring both code paths properly handle fractional GPU values: - 400m 0.4 GPU - 1 1 GPU - 4 4 GPUs Added unit test TestAddWellKnownAcceleratorResources_WithFractionalGPU to validate the fix covers container resource limits. Fixes Issue ray-project#4447: Enable fractional GPU serving support
…of fractional GPU resources to Ray start parameters
…PU by removing unnecessary GroupResource wrapper
…sterWithFractionalGPU
- Changed WithResources(rayv1ac.GroupResource().WithRequestedResources(...)) to WithResources(map[string]string{...})
- Fixed API usage to match the correct signature for setting resource specs in worker group
- Added 2-second graceful shutdown to allow operator cleanup before namespace deletion
- Prevents race condition where test cleanup happens before operator finishes cleanup operations
- Fixes issue ray-project#4447: Add support for fractional GPU values in Ray start parameters
…nt namespace termination race - Added 2-second sleep before namespace deletion in TestRayClusterWithResourceQuota - Prevents 'unable to create new content in namespace because it is being terminated' error - Same fix as applied to TestRayClusterWithFractionalGPU - Addresses CI test flakiness during cleanup phase
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Feature] Add support for fractional GPU values in Ray start parameters and corresponding tests
Why are these changes needed?
This PR adds support for fractional GPU values in Ray start parameters, addressing issue #4447.
Problem: Users need to serve multiple small LLM models on a single GPU using Ray's fractional GPU serving feature (e.g., 0.4 GPU per model). The autoscaler was rejecting fractional GPU values with the error:
"0.4 is not of type 'integer'".Solution:
pod.go: Changed GPU resource conversion fromint64()tofloat64()to support fractional valuespod_test.go:TestUpdateRayStartParamsResources_WithFractionalGPUvalidates the conversion logicraycluster_test.go:TestRayClusterWithFractionalGPUvalidates end-to-end integrationThis enables users to specify fractional GPU allocations like
GPU: "0.4"in their Ray placement groups for efficient multi-model serving.Related issue number
Closes #4447
Checks
TestUpdateRayStartParamsResources_WithFractionalGPUvalidates GPU conversion logicTestRayClusterWithFractionalGPUlocally (passes in 1.07s)Test Results
=== RUN TestRayClusterWithFractionalGPU raycluster_test.go:327: [2026-01-28] Created RayCluster for testing fractional GPU conversion raycluster_test.go:343: [2026-01-28] RayCluster pods created successfully raycluster_test.go:366: ✓ Test passed: RayCluster with fractional GPU configuration created successfully --- PASS: TestRayClusterWithFractionalGPU (1.07s) PASSChanges Summary
ray-operator/controllers/ray/common/pod.goray-operator/controllers/ray/common/pod_test.goray-operator/test/e2e/raycluster_test.go